Skip to content

feat: Migrate the annotation pipeline onto the allele model + append-only event log#783

Draft
bencap wants to merge 23 commits into
feature/bencap/741/register-all-alleles-with-clingenfrom
feature/bencap/742/annotation-pipeline-allele-refactor
Draft

feat: Migrate the annotation pipeline onto the allele model + append-only event log#783
bencap wants to merge 23 commits into
feature/bencap/741/register-all-alleles-with-clingenfrom
feature/bencap/742/annotation-pipeline-allele-refactor

Conversation

@bencap

@bencap bencap commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Builds on the deduplicated allele / valid-time mapping model from #741 (and #739). This PR moves the external-annotation pipeline onto that model and replaces per-variant annotation status with an append-only event log.

Two themes:

  1. Annotation jobs link to alleles, not MappedVariant. gnomAD, VEP, and ClinVar now resolve and link against the deduplicated allele space (authoritative and RT-derived alleles), with version-keyed refresh and idempotent re-runs. CAR/LDH already ran on alleles.
  2. Status becomes an append-only AnnotationEvent log. A single log across the pipeline records disposition + why + when per subject; "current" is derived (v_current_annotation_events). This can distinguish confirmed-absence from never-checked — the gap VariantAnnotationStatus couldn't express.

Still parallel-tables: new link tables and the event log are written for new runs; legacy MappedVariant columns and association tables stay frozen-serving. Backfill/read-cutover/drops are deferred to #775/#747.

Annotation jobs onto the allele model

  • gnomADgnomad_allele_links (one live link per allele; supersede-on-change; version-keyed skip with a force bypass).
  • VEPvep_allele_consequences (collapsed record+link; keyed on the Ensembl release; supersede only on a changed consequence so categorical values don't churn each release). HTTP rejections are routed to Variant Recoder rather than failing the batch.
  • ClinVarclinical_controls renamed to clinvar_controls (ORM/table/constraint only; the ClinicalControl* serving contract is intentionally frozen) + multi-live clinvar_allele_links (one live link per archival release; get-or-create). Adds forward-captured, unserved clinvar_variation_id.
  • All consumers link every current allele via the shared group_alleles_for_annotation primitive (one work-unit per allele); force_reregister re-registers existing CAIDs.
  • New lib/alleles.py::get_allele_translations — cross-layer equivalence via mapping-record co-membership, with as_of temporal support.

Append-only annotation event log

  • New annotation_event table + v_current_annotation_events view (latest-event-per-subject). AnnotationStatusManager rewritten to emit events; mapping, reverse-translation, gnomAD, VEP, ClinVar, and CAR/LDH all emit events from their resolution outcomes.
  • Records status, not value: disposition (present/absent/not_applicable/failed; pending = no event) + a granular per-domain reason. Events share the job's DB session (flushed-not-committed), so they're transactionally coupled to the annotation data they describe.
  • Adds a pipeline-tracking + event-aware annotations script.

Retired

  • Deleted the populate_hgvs_for_score_set and populate_variant_translations_for_score_set jobs (hgvs.py, variant_translation.py) — both DAG leaf nodes, no edge repair needed. Allele.hgvs_* is mapping-populated; the RT equivalence space replaces variant translations. lib/variant_translations.py + the variant_translations table stay frozen (serving-only).

Migrations

8 new (the three annotation link/consequence tables, the clinvar_controls rename, clinvar_variation_id, the event table + view, and an index on annotation_event.score_set_id) plus a rename of the existing alleles/mapping-records migration to match its contents. Single alembic head; all have working downgrades.

Testing

mypy src/ and ruff green. Updated/added coverage across tests/{lib,models,worker,routers} for each migrated job, the event model + view, and the rewritten status manager. (API/worker tests per project convention; UI out of scope.)

Deferred follow-ups

bencap added 23 commits June 22, 2026 11:45
…-keyed refresh

Replace MappedVariant-based gnomAD linkage with valid-time GnomadAlleleLink rows
keyed on the deduplicated Allele. Linking covers every current allele of a score
set (authoritative and RT-derived), so protein/coding score sets — whose genomic
allele is RT-derived — are no longer dropped; per-variant annotation status flows
through the _annotate_gnomad choke point against authoritative links only (interim
bandaid, the AnnotationEvent migration seam).

Refresh / idempotency model:
- One live link per allele (unique index on allele_id); a gnomAD version bump
  supersedes rather than accumulating one live link per version.
- Supersede only on change — an unchanged re-run leaves the live link untouched,
  so the valid-time history records no spurious boundary.
- Version-keyed skip avoids re-fetching alleles already current at the version; a
  force param bypasses the skip (re-ingestion / heal) without churning unchanged links.
- Per-variant status is a per-run audit event: created / preexisting / skipped.

Normalize CAIDs across the Athena join: the gnomAD Hail dump drops leading zeros
(CA025094 -> CA25094), so exact-string matching silently missed zero-padded CAIDs.

Extract group_alleles_for_annotation as the shared allele-grouping primitive for
allele-subject annotation jobs (adopted by gnomAD; VEP/ClinVar to follow).

Refs #742. Partially addresses #722 (CAID-completeness tracking there stays open).
Migrate the VEP functional-consequence job (Step 2 of the annotation
infrastructure migration) off MappedVariant onto the deduplicated allele
model.

- New ValidTime vep_allele_consequences table: a single allele-keyed row
  collapses record + link (the consequence is a scalar with no shared
  external entity, unlike gnomAD/ClinVar). One live consequence per allele
  via a partial unique index.
- Job runs over the score set's full allele set (authoritative + RT-derived)
  via get_alleles_for_score_set + the shared grouping primitive; VAS still
  fans only to authoritative variants (the bandaid seam).
- Version-key the refresh on the Ensembl release (/info/software): skip
  alleles already current, abort if the release can't be fetched. Supersede
  is value-keyed, not version-keyed — an unchanged consequence at a new
  release bumps source_version/access_date in place to avoid churning
  history. force bypasses the skip (e.g. after editing VEP_CONSEQUENCES).
- A no-result is treated as a non-answer, never a negative: held
  consequences are not retired on an empty/failed VEP fetch.
- Annotation links stay one-directional to Allele (no reverse back-ref).

Adds lib + job tests covering linkage, RT-derived scope, version skip,
in-place bump, supersede-on-change, no-result handling, and release-fetch
failure.
Step 3 of the #742 external-annotation migration. ClinVar linkage moves
off MappedVariant onto the deduplicated allele model.

- Rename clinical_controls -> clinvar_controls (ORM model + table +
  unique constraint). Internal only: the ClinicalControl* view models,
  the /clinical-controls serving endpoints, and the frozen
  mapped_variants_clinical_controls association are unchanged, since the
  view-model name is both the OpenAPI schema name and the record_type
  discriminator consumed by the UI.
- New clinvar_allele_links ValidTime table + ClinvarAlleleLink model.
  Multi-live: partial unique index (allele_id, clinvar_control_id) WHERE
  valid_to IS NULL, so an allele accumulates one live link per release.
- Refactor refresh_clinvar_controls onto get_alleles_for_score_set +
  group_alleles_for_annotation (payload = CAID, full allele scope). Links
  are get-or-create; a same-version re-resolution to a different control
  supersedes newest-wins (gap-free retire+insert) rather than leaving two
  live links. VAS writes funnel through the _annotate_clinvar choke point,
  fanned only to authoritative_variant_ids and version-scoped.
- Additively capture ClinVar's VariationID: nullable clinvar_variation_id
  column populated forward from the variant_summary TSV (parse degrades to
  None on archival schemas lacking the column). Unserved; the dedicated
  clinvar_variants remodel is deferred to the read-cutover.

Tests rewritten for the allele model, covering the multi-live link writes,
the version-scoped supersede guard, and the authoritative-only VAS fan-out.
Steps 4 & 5 of the #742 migration. Both jobs are redundant under the allele model:
Allele.hgvs_g/c/p are populated by the mapping job, and the reverse-translation
equivalence space (genomic/coding/protein alleles linked per MappingRecord) replaces
the ClinGen PA<->CA translation table.

- Remove populate_hgvs_for_score_set and populate_variant_translations_for_score_set
  from the pipeline DAG (both were leaf nodes — no dependency edges to repair), the
  worker registry (BACKGROUND_FUNCTIONS + STANDALONE_JOB_DEFINITIONS), and the
  external_services package exports.
- Delete the two job modules and the now-orphaned ClinGen HGVS helpers
  (extract_hgvs_from_ca_allele_data / extract_hgvs_from_pa_allele_data), used only by
  the HGVS job.
- Keep lib/variant_translations.py and the variant_translations table/model, marked
  FROZEN (serving-only) — they back old-model serving and are dropped at read-cutover.
- Delete the obsolete job tests and their conftest fixtures.
get_allele_translations(db, allele_id, *, as_of=None) returns an allele's full
cross-layer equivalence set (genomic/coding/protein) by traversing the
MappingRecordAllele link graph: allele -> its live links -> mapping record(s) ->
all co-linked alleles.

The relation is co-membership in a MappingRecord's allele set, not a shared
identifier — ClinGen's CAID spans only the nucleotide layers (the protein allele
carries a distinct PA), so the link graph is the only thing tying all layers
together. This replaces what the retired variant_translations PA<->CA table provided.

Forward-compatible with temporal reads: the same half-open valid-time predicate is
applied at both the anchor and fan-out hops, so passing as_of reconstructs the
equivalence set as of any instant. Defaults to the currently-live set.
…bulary

Introduces the AnnotationEvent log: a single append-only event table whose
subjects are Variant and Allele, selected by annotation_type via a polymorphic
CHECK. "Current" is derived (DISTINCT ON … id DESC), never stored. Adds the
shared Disposition (present/absent/not_applicable/failed) and EventReason
vocabulary, the v_current_annotation_events view, and the supporting migrations.
group_alleles_for_annotation now returns one job-specific payload per allele
instead of an AlleleAnnotationGroup. Drops the authoritative_variant_ids bandaid
seam (and is_authoritative from ScoreSetAlleleRow): annotation is an allele-level
fact, and with the AnnotationEvent log the per-variant status fan-out goes away.
AnnotationStatusManager now writes AnnotationEvent rows via record_event and
derives current status from the event log instead of maintaining per-variant
status rows. Resolves subjects through VARIANT_SUBJECT_TYPES/ALLELE_SUBJECT_TYPES
and the Disposition vocabulary.
link_gnomad_variants_to_alleles now returns a GnomadLinkVerdict per touched allele
(CREATED/UNCHANGED) as the single source of truth for per-allele status; the job
derives annotation events directly from the verdict map rather than re-querying
link state. Alleles absent from the map are read as 'gnomAD had no record'.
VEP resolution now returns a VepResolution (consequences vs errored) so a genuine
empty is distinguished from an unknown, and linking returns a VepLinkVerdict per
allele (CREATED/UNCHANGED). The job derives annotation events from these outcomes
instead of re-querying consequence state; errored HGVS are retried, not recorded
as negatives.
The ClinGen allele-registration and LDH-submission jobs now record AnnotationEvents
via record_event, using the shared EventReason vocabulary (created/preexisting/
reconfirmed/submitted, and the CAR-specific failure and reconfirmation-skipped
reasons) in place of per-variant status updates.
The ClinVar job records AnnotationEvents via record_event, distinguishing
superseded/no_record/no_caid/multi_variant_caid cases through the shared
EventReason vocabulary. Clarifies the ClinicalControl comments to mark the
row-level vs variation-level link identifiers.
The mapping job records AnnotationEvents via record_event, deriving disposition
and reason from the typed MappingOutcome (mapped / intronic /
no_protein_consequence / failed) rather than writing per-variant status rows.
The reverse-translation job records AnnotationEvents via record_event, using the
RT-specific EventReason cases (translated / no_coding_transcript /
no_assay_level_hgvs / translation_failed / translation_error /
transcript_unresolved) to distinguish informative negatives from failures.
Adds the pipeline_tracking script for querying pipeline progress from the
AnnotationEvent log, and updates the variant_annotations script to read/write
through the event model.
An allele with an existing CAID always matched the preexisting branch first, so
force_reregister never re-submitted it. Existing-CAID alleles now route to pending
whenever force_reregister is set.
Split the resolution exception handling: an HTTPError (VEP rejecting a batch, e.g.
a 400 for protein HGVS it cannot parse) now routes the batch to Variant Recoder,
which may still resolve it via genomic recoding. Only transport/network errors
(RequestException) are marked errored and held for retry — fixing the case where
recoverable inputs were wrongly recorded as unknown.
normalize_and_identify/_rle_to_lse assume an inlined SequenceReference and integer coordinates, but the VRS models type location.sequenceReference and location.start/rle.length as unions (iriReference/Range/None). Add narrowing assertions documenting that contract — restoring a green `mypy src/` (a required CI job) and turning the unreachable bad-input case into a clear AssertionError instead of a deeper AttributeError. Adds guard tests.
The newest-wins ClinvarAlleleLink supersede stamped valid_to/valid_from with a tz-naive datetime.now() into timestamptz columns, skewing the valid-time axis against every other job's func.now()-stamped rows. Use supersede_with so retire and insert share one DB timestamp. Test now asserts the boundary is timezone-aware.
Every other FK on annotation_event was indexed except score_set_id. Add it (model __table_args__ + migration) to back the ON DELETE SET NULL cascade on score-set deletion and score-set-scoped audit queries.
Revision a1b2c3d4e5f6 creates alleles/mapping_records/mapping_record_alleles — there is no allele-closure table anywhere (equivalence is a runtime query). Rename the file so it stops misleading readers; the revision id is unchanged.
Re-add cases dropped in the manager rewrite: None/empty reads, no-op flush, auto-flush-before-read for both query methods, a FAILED-disposition round-trip through the manager, and per-subject current-status isolation.
…ents

split_cis_phased_hgvs raised ValueError on bracketed input lacking an accession; it now degrades to a single-element list. join_cis_phased_hgvs emits components in coordinate order so CSV exports are deterministic regardless of VRS member ordering (the block digest is order-independent).
@bencap bencap linked an issue Jun 30, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Extend annotation pipeline to cover Allele entities

1 participant